1 EDA

We can first visualize the data after dimension reduction with zinbWave. We then use t-SNE and color the cells with the allen labels. Since we use various values for the k parameter, we plot two representations with the most extreme values of k.

2 General comments

We have the following workflow

In the Seurat clustering, there are two parameters to choose from: the resolution and the k.param (use in a knn step). From the seurat help file, for the resolution parameter: use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities.

We picked a value of 1.6 as we want many clusters to start with. We also pick k = 50. Those values seem intermediate in term of ARI with other pairs of paramters (see section Seurat ARI param)

We also consider two ways of computing the ARI between RSEC and other methods. Either, we force RSEC to assign all cells to a given cluster, or we only compute the ARI between RSEC and the other methods on those cells that RSEC do cluster. Note that in the second case, other pairs of methods are compared using all cells. We denote as RsecT the cluster assignement where all cells are assigned (i.e Rsec Total).

The ARI merging method works as follow. We iterate over all pairs of clusters for every clustering method. For each pair, we try to merge the two clusters and see how it improves the ARI with the other methods. We then merge the pair that improves the ARI the most. We stop when the ARI cannot be improved anymore.

In the general case, we perform the ARI merging without the allen labels and we then use it as a comparison. A quick overview of how the algorithm performs with the allen labels is seen at the end.

3 Reduction in the number of clusters

4 Improvement in ARI

5 Comparison with Allen subclass

5.1 After ARI merging

5.2 After a last consensus step

5.3 Stopping during the merging

6 Seurat ARI Param

---
author: "Hector Roux de Bézieux"
date: '`r format(Sys.time(), "%d %B , %Y")`'
output:
  html_document:
    toc: true
    toc_depth: 2
    number_sections: true
    code_download: TRUE
    
params:
  dataset: "SMARTer_nuclei_MOp"
  title: "Analysis of the SMARTer_nuclei_MOp dataset"
---
---
title: `r params$title`
---

```{r load packages, include=F}
library(knitr)
opts_chunk$set(
  fig.pos = "!h", out.extra = "", warning = F, message = F,
  fig.width = 5, fig.align = "center", echo = F
)
libs <- c("here", "dplyr", "ggplot2", "tidyr", "stringr", "readr", "cowplot",
          "clusterExperiment", "mclust", "RColorBrewer", "progress", "Dune",
          "png", "purrr")
suppressMessages(
  suppressWarnings(sapply(libs, require, character.only = TRUE))
)
rm(libs)
type <- function(dataset) {
  if (str_detect(dataset, "SMART")) return("Smart-Seq")
  if (str_detect(dataset, "10x")) return("10X")
  if (str_detect(dataset, "Regev")) return("Regev")
}
type <- type(params$dataset)
mergers <- readRDS(here("data", "Dune",
                                 paste0(params$dataset, "_mergers.rds")))
```

# EDA

We can first visualize the data after dimension reduction with zinbWave. We then use t-SNE and color the cells with the allen labels. Since we use various values for the k parameter, we plot two representations with the most extreme values of k.

```{r t-SNE plots, fig.width=18, fig.height=9}
plots <- list.files(here("Figures", "EDA"))
plots <- plots[str_detect(plots, params$dataset)]
plots <- plots[str_detect(plots, "png")]
plot.new()
plot.window(0:1, 0:1)
rasterImage(readPNG(here("Figures", "EDA", plots[1])), 0, 0, .5, 1)
rasterImage(readPNG(here("Figures", "EDA", plots[length(plots)])), 0.5, 0, 1, 1)
```


# General comments

We have the following workflow

```{r workflow, fig.height=6, fig.width=10}
plot.new()
plot.window(0:1, 0:1)
rasterImage(readPNG(here("Explainations", "Workflow.png")), 0, 0, 1, 1)
```

In the Seurat clustering, there are two parameters to choose from: the resolution and the k.param (use in a knn step). From the seurat help file, for the resolution parameter: use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities.

We picked a value of 1.6 as we want many clusters to start with. We also pick k = 50. Those values seem intermediate in term of ARI with other pairs of paramters (see section Seurat ARI param)

We also consider two ways of computing the ARI between RSEC and other methods. Either, we force RSEC to assign all cells to a given cluster, or we only compute the ARI between RSEC and the other methods on those cells that RSEC do cluster. Note that in the second case, other pairs of methods are compared using all cells. We denote as RsecT the cluster assignement where all cells are assigned (i.e Rsec Total). 

The ARI merging method works as follow. We iterate over all pairs of clusters for every clustering method. For each pair, we try to merge the two clusters and see how it improves the ARI with the other methods. We then merge the pair that improves the ARI the most. We stop when the ARI cannot be improved anymore.

In the general case, we perform the ARI merging without the allen labels and we then use it as a comparison. A quick overview of how the algorithm performs with the allen labels is seen at the end.

# Reduction in the number of clusters

```{r Imp no allen}
plotPrePost(mergers)
```

# Improvement in ARI

```{r ARI imp no allen cell}
plot_grid(
  plotARIs(mergers$initialMat) + ggtitle("Before merging"),
  plotARIs(mergers$currentMat) + ggtitle("After merging")
)
```


```{r ARI trend, fig.width=9}
ARItrend(merger = mergers)
```

# Comparison with Allen subclass

## After ARI merging

```{r comp cell}
if (!str_detect(params$dataset, "Regev")) {
  allen_clusters <- read.csv(here("data", type,
                                  paste0(params$dataset, "_cluster.membership.csv")),
                             col.names = c("cells", "cluster_id"))
  clusters <- read.csv(here("data", type,
                            paste0(params$dataset, "_cluster.annotation.csv")),
                       header = T)
  allen_clusters <- full_join(allen_clusters, clusters) %>%
    arrange(cells)
  rm(clusters)
  
  final <- map_df(mergers$currentMat[order(rownames(mergers$initialMat)),],
                  adjustedRandIndex, y = allen_clusters$subclass_label) %>%
    pivot_longer(everything(), names_to = "clus_method", values_to = "ARI")
  initial <- map_df(mergers$initialMat[order(rownames(mergers$initialMat)),],
                    adjustedRandIndex, y = allen_clusters$subclass_label) %>%
    pivot_longer(everything(), names_to = "clus_method", values_to = "ARI")
  
  Monocle <- read.csv(here("data", "SingleMethod", paste0(params$dataset, "_Monocle.csv"))) %>%
    select(-X)
  colnames(Monocle) <- paste0("Monocle_", str_remove(colnames(Monocle), "^k_"))
  colnames(Monocle)[ncol(Monocle)] <- "cells"
  sc3 <- read.csv(here("data", "SingleMethod", paste0(params$dataset, "_SC3.csv"))) %>%
    select(-X)
  colnames(sc3) <- paste0("sc3_",
                          str_remove(colnames(sc3), "X") %>% str_replace("\\.", "-"))
  colnames(sc3)[ncol(sc3)] <- "cells"
  Seurat <- read.csv(here("data", "SingleMethod", paste0(params$dataset, "_Seurat.csv"))) %>%
    select(-X)
  colnames(Seurat) <- paste0("Seurat_", str_remove(colnames(Seurat), "^X"))
  colnames(Seurat)[ncol(Seurat)] <- "cells"
  
  singles <- full_join(sc3, Seurat) %>% full_join(Monocle) %>%
    arrange(cells) %>% select(-cells) %>% as.list() %>%
    map_df(., adjustedRandIndex, y = allen_clusters$subclass_label) %>%
    pivot_longer(everything(), names_to = "method", values_to = "ARI") %>%
    dplyr::mutate(clus_method = word(method, 1, sep = "_"),
                  level = word(method, 2, sep = "_"))
  ggplot(singles, aes(x = clus_method, y = ARI)) +
    geom_violin() +
    theme_classic() +
    geom_jitter(width = .01,
      data = bind_rows("Final" = final, "Initial" = initial, .id = "step"),
      aes(col = step)) +
    scale_color_brewer(type = "qual") +
    scale_y_continuous(limits = c(min(initial$ARI, final$ARI, singles$ARI),
                                  max(initial$ARI, final$ARI, singles$ARI)))
}
```

## After a last consensus step

```{r final consensus}
clusters <- mergers$currentMat
r <- which(colnames(clusters) == "RsecT")
if (all.equal(integer(0) ,r) != TRUE) {
  clusters[,"Rsec"] <- assignRsec(mergers) 
}
clusters <- as.matrix(clusters) 

cellsConsensus <- Consensus(clusMat = clusters,
                            large = (type != "Smart-Seq"))
if (type == "Regev") {
  consensus <- cbind(cellsConsensus,
                     clusters)
} else {
  consensus <- cbind(cellsConsensus,
                     allen_clusters$subclass_label,
                     clusters)
  colnames(consensus)[c(1, 2)] <- c("Consensus", "Allen\nsubclass")
}
plotClusters(object = consensus)
title("Consensus Merging with proportion\nand comparison with the allen subclass.")

rm(clusters, cellsConsensus, consensus)
```

## Stopping during the merging

```{r consensus with intermediate}
midMat <- intermediateMat(merger = mergers,
                          p = .9)
r <- which(colnames(midMat) == "RsecT")
if (all.equal(integer(0) ,r) != TRUE) {
  midMat[,"Rsec"] <- assignRsec(mergers, p = .9)
}
midMat <- as.matrix(midMat)

cellsConsensus <- Consensus(clusMat = midMat,
                            large = (type != "Smart-Seq"))
if (type == "Regev") {
  consensus <- cbind(cellsConsensus,
                     midMat)
} else {
  consensus <- cbind(cellsConsensus,
                     allen_clusters$subclass_label,
                     midMat)
  colnames(consensus)[c(1, 2)] <- c("Consensus", "Allen\nsubclass")
}
colnames(consensus)[c(1, 2)] <- c("Consensus", "Allen\nsubclass")
plotClusters(object = consensus %>% as.matrix())
title("Consensus Merging when stopping early")

rm(cellsConsensus, consensus, midMat)
```

# Seurat ARI Param

```{r seurat param}
plot.new()
plot.window(0:1, 0:1)
rasterImage(readPNG(
  here("Figures", type, paste0(params$dataset, "_Seurat_ARI.png"))),
  0, 0, 1, 1)
```
